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Bio, right? 


«Who's here for the Kafka talk then? 


Bio, right? 


«Who's here for the Kafka talk then? 
«Who really cares about bios? 
* Why do you care again? 


Bio, right? 


«Who's here for the Kafka talk then? 
* Who really cares about the bios? 
“Why do you care about them? 


* AOL, AltaVista, Sphinx, Samsung, Fed Treasury, 
a. Valve, BH 


let's start backwards 


«This will probably be a BAD talk... ;( 
*... hopefully inciting some GOOD thinking! 


“Do you know the required “world CONSTANTS”? 
“Do you know the CURRENT ones? 

«Let's BRIEFLY cover all that jazz 

e [his will probably be a BAD talk... ;( 

°...hopefully inciting some GOOD thinking! 


“Can you COUNT? As in... 

“Can you PREDICT the system performance? 

“Do you know the required “world CONSTANTS”? 
“Do you know the CURRENT ones? 

«Let's BRIEFLY cover all that jazz 

e [his will probably be a BAD talk... ;( 

*... hopefully inciting some GOOD thinking! 


"World 
constants?!” 


but, IT version? 


zee 
principal 
intermezzo 


[Dean2010] 


L1 cache reference 

Branch mispredict 

L2 cache reference 

Mutex lock/unlock 

Main memory reference 

Compress 1K bytes with Zippy 

Send 1K bytes over 1 Gbps network 
Read 4K randomly from SSD* 

Read 1 MB seguentially from memory 
Round trip within same datacenter 
Read 1 MB sequentially from SSD“ 
Disk seek 

Read 1 MB seguentially from disk 
Send packet CA->Netherlands - >CA 
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14x L1 cache 


20x L2 cache, 200x L1 cache 
~1GB/sec SSD 


—1GB/sec SSD, 4X memory 
20x datacenter roundtrip 


80x memory, 20X SSD EJ 


Alas, things decay 


«That was then - 2010, this is now - almost 2023 
e[Dean2010] is still a GREAT start, but 


«Today it's (UPTO) SOMEWHAT OFF 
*Anyday it's NOT QUITE COMPLETE 
«Today | have UPDATES & ADDITIONS 
“And, of course, benchmarks [| 


the 3 cornerstones? 


disk 


leť s start easy! 


Disk! 


“A single “IO” => reading a certain “atom” (sector) 
«Sectors are 4096 bytes now 


«HDD = —100...200 random ¡ops total 
«HDD = —100...200 MB/sec linear... no change there, but 


«SSD = -500K (!!!) total 32T iops, -50K 1T iops 
«SSD = -3500...5000+ mb/sec linear 


Disk! 


e This actually ALREADY differs vs scripture... 


* Smaller thing - newer hardware, faster reads 
e Bigger thing - UNITS 


[Dean2010] 


Read 4K randomly from SSD* 150,000 ns / -1GB/sec SSD 
Read 1 MB sequentially from SSD“ 1,000,000 ns / -1GB/sec SSD, 4X memory 
Disk seek 10,000,000 ns / 20x datacenter roundtrip 


Read 1 MB seguentially from disk 20,000,000 ns / 80x memory, 20X SSD 


Disk! 


.“ 100 MB/sec” vs “10...20,000,000 ns/ MB"?! 


*Personally, MB/sec, ofc, any day 
“Meh option, msec/MB 


* But, pick YOUR poison 


moving on 


CPU! 


*Refresher, a few general facts 


e CPU reads both data and code from RAM 

* CPU caches those reads, multi-layer L1 / L2 / etc 
«CPU then works with internal REGISTERS 

“And does so in “regular” CLOCK TICKS 


el physical multi-core CPU == N logical + shared 
RAM 


CPU! 


«Fun fact, ALL those are just “tips of the icebergs” 
“Because complicated caches, associativity, TLB, etc 
* Because register renaming and mu-ops and out-of-order 
* Because floating frequency and thermal throttling 


«Modern CPUs are very complicated 
“Do not predict (this was so Pentium III), BENCH! 
e Still, a few simple rules’o’thumb DO exist! 


CPU! 


«Facts: its “always” 64-bit, multi-core, SIMD, ... 
"RULE OF THUMB : IT’S “ALWAYS” -3 GHz 
«Fact: arithmetic is VERY cheap, 1 tick/op... or less! 


«Fact: but, beware of “load-hit-store"! 
«How much “ops” forC=A+B*3? 


moving sideways! 


CPU vs the “usual” ops 


“Because who cares about C = A + B * 3 
Even at 5+ ticks that's zeunds billions 


“And what are our USUAL SUSPECTS? 


CPU vs the “usual” ops 


* Okay, how MANY ticks (or nsecs) do we need for... 


emalloc()? 

egsort()? 

ehash find()? 
epthread mutex lock()/unlock()? 
«fast compress()/decompress()? 


CPU vs the “usual” ops 


* Or for even higher level building blocks 


*accept() etc overheads? 

edb connect() overheads? 

ejson parse() cost? 
eprotobuf::SerializeToString() cost? 
*...etc, YOU name it... and BENCH it 


show me yours 


«My code is mostly C++ 
«My “blocks” are malloc() etc 


«My benchmarks reflect that [] Ima 
| PROGRAMER 
“YOUR code builds on them too 
PROGRAMMAR- 
PROGRAMAR 


| write code. 


malloc() 


"RULE OF THUMB : —1 M allocs/sec 


“Beware, you mileage may vary GREATLY 
«VERY sensitive to size, le 13 bytes vs 175,000 bytes 
* VERY dependent on allocator, le glibc vs je vs... 


Note, avoid 1M allocs/sec! (Pools work fine in 2023) 


gsort() 


"RULE OF THUMB : —12...14 M ints/sec 
"Raw data = 0.70...0.85 sec / 10M ints 


e Homework #1, apply O(n log n) and 3 GHz, solve 
for C 


*Homework #2, predict IM or LOOM timings 
*Homework #3, compare with radix... 


hash find() 


"RULE OF THUMB : —20...60 M calls/sec 


“K = short strings, 99.8% under 16 bytes 
eV = ints, N = millions 


«|mplementations vary SEVERELY 
estd::unordered map << abseil ... << handmade 


rwlock() readonly 


"RULE OF THUMB : —5...10 M locks/sec 


“Raw data = -33M locks/sec, uncontended 1 rd 
“Raw data = —9M locks/sec, 2 rd, O wr 
“Raw data = -6M locks/sec, 128 rd, O wr 


eHomework #1: bench this vs mutex() 


raw data 


| rd | wr | pthreads | 

| 1 | O | 39.5 ms | 32.8 M/sec 
| 2 | O | 230.8 ms | 8.7 M/sec 
| 4 | O | 630.2 ms | 

| 8 | O | 1227.7 ms | 

| 16 | 0 | 2570.0 ms | 

| 32 | 0 | 5090.7 ms | 

| 04 | 0 | 11061.3 ms | 

| 128 | 0 | 21830.8 ms | 5.8 M/sec 


rwlock() rdwr 


"RULE OF THUMB : —3...6 M locks/sec 
"RULE OF THUMB: —2.5...0.1 M writes/sec 


e Different benchmark, 2x writers at all times 
eReaders slow down writers, as expected 


raw data 


| rd | wr | pthreads | 

| 1 | 2 | 739.8 ms | 4.0 M/sec 

| 2 | 2 | 1245.1 ms | 3.2 M/sec 

| 4 | 2 | 1428.0 ms | 4.2 M/sec (!) 

| 8 | 2 | 1831.3 ms | 5.5 M/sec 

| 16 | 2 | 3214.5 ms | 5.6 M/sec 

| 32 | 2 | 5678.9 ms | 6.0 M/sec (III) 
| 64 | 2 | 12132.8 ms | 5.4 M/sec 

| 128 | 2 | 23299.6 ms | 5.6 M/sec 


additlons summary 


malloc -1.0 M/sec 
qsort(int) -12 .. 14 M/sec 
hash find(short str) —20 .. 60 M/sec 
rwlock() reads ~3 „ 6 M/sec 
rwlock() writes ~O .. 2 M/sec 
decompress() ~] .. 3 GB/sec 


compress() -0.1 m 1.0 GB/sec 


compress() 


"RULE OF THUMB : —1...3 GB/sec decompress 
"RULE OF THUMB: —0.1...1 GB/sec compress 


*Dean2010 => “Zippy” (now “Snappy”), still alive 
In 2023, we have LZ4 and zstd 


*Briefly: LZ4 for speed, zstd for quality, and 
*Briefly: NEVER gzip, NEVER Snappy 


that's not quite CPU 


the greatest beast 
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RAM! 


«Maybe just my pet peeve, but IMO it's RAM 
eSeemingly simple, lots of timings, some docs, but... 


e Predicting RAM perf?! 17th CIRCLE OF HELL 


* CPU = super complicated, but smth “works” 


«RAM = seems Innocuous, NOTHING “works” 
“Yes, that's a rant 


“Bad" news 


“All those previous “CPU” benchmarks 
«They are not CPU only 
“They are CPU + RAM, all of them 


«NB: they are ofc VALID 
*You can't “predict” those based on CPU only 
“You must account for RAM... or bench (as we did) 


“Good” news 


* What do we do when the reality is too complicated? 
«What if simple models don't work? 
Easy, we use ML 


«What do we do when theory just doesn t compute? 
«What if simple models don't work? 
"Easy, we use... 


benchmarks 


“Useful” news 


eHow to benchmark? 
You MUST account for cache 
“You MUST have enough data... or it's all cached 


*L1 is 64 bytes/cacheline, 16-32 KB total “today” 
*L2 Is 2...4 MB total (per core) “today” 
“Thus, 32 MB at the VERY least 


Good news 


“Aka statistics, the simplest form of ML ever 
«Without further ado! 


"RULE OF THUMB : —40...50M "io”/sec random 
"RULE OF THUMB : —15 GB/sec linear 


"Everywhere! 4th gen desktop, 11th gen laptop, 
Xeon... 


Good news v.2 


“ But, Apple M4 and LPDDR5 changes everything 


"RULE OF THUMB : —120...1 50M “io”/sec 
random 


"RULE OF THUMB: —40...50 GB/sec linear 


«Yes, that is 3 times faster 
“No, most servers are not there yet 


what have we 
learned? 


Summary 


«There are CPU, RAM, and disk 

e There are [Dean2010] constants 

«There are [Aks2022] constants 

“You can (and should) bench some of your own 


“You can estimate system performance with 
those 


“And get reasonably precise results too, but 


This site can't be reached 
domain.com took too long to respond. 


Try: 
e Checking the connection 
e Checking the proxy and the firewall 


ERR CONNECTION TIMED OUT 


Details 


Incognito © 


Thať s it 


«[easer #1, csv parse() 
«Teaser #2, SELECT SUM(a+b*3) 


Questions and comments? 
*Homework and benchmarks? 

«Very interesting, we want a workshop? 
«Complex and boring, never again? 


> 


I] @shodanium 


CL 
O 
A 
Il 
u 
=) 
O 
= 
> 
c 
O 
| 
© 
U 
Le 
rw 
= 
O 
LL. 


For the brave 
telegram 
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Or We could, you know, 


TALK [] 


L1 cache reference 
Branch mispredict 
L2 cache reference 
Mutex lock/unlock 


Main memory reference 
cache 


Compress 1K bytes with Zippy 

Send 1K bytes over 1 Gbps network 
Read 4K randomly from SSD“ 

Read 1 MB sequentially from memory 
Round trip within same datacenter 
Read 1 MB sequentially from SSD* 
Disk seek 

Read 1 MB sequentially from disk 
Send packet CA->Netherlands->CA 


dean-2010 oldies goldies 
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YN m x x“ x x x x N 


14x L1 cache 


20x L2 cache, 200x LI 


~1GB/sec SSD 


—1GB/sec SSD, 4X memory 
20x datacenter roundtrip 
80x memory, 20X SSD 


Google 


Ll cache reference 0.33 
Branch mispredict 2272 
hash lookup() 16, «30 
Main memory reference 20..25 
Rwlock (mutex?) lock/unlock 100, .250 
malloc() 1,000 
[De]compress 1K bytes with lz4/zstd 2272 
Send 1K bytes over 1 Gbps network 10,000 
Read 4K randomly from SSD * 2090200, 000 
Read 1 MB seguentially from memory 66,000 
Read 1 MB seguentially from SSD * 333,000 
Round trip within same datacenter 500,000 
HDD disk seek 5..10,000,000 
Read 1 MB seguentially from HDD 10, 000, 000 
gsort() 1M ints 70,000,000 
Send packet CA->Netherlands->CA 150, 000, 000 


shodan-2022 update, TRUST NO ONE, INSERT YOUR VERY OWN BLOCKS HERE! 
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3000 Mops/s 


20..60 Mops/sec 
40..50+ Mops/s 
4..10 Mops/s 

—1M allocs/sec ??? 


100 MB/sec 

5..50 Kops/sec 
15+ GB/sec 

3+ GB/sec 

0.5 ms 

100..200 ops/sec 
100+ MB/sec 

-15 Mints/sec 
150 ms 


SUM(a+b*3) Journey 


elinear.cpp 
int64 t r = 0; // sum(a+b*3) 
for (int i = 0: i<N; i++) // N = 40M 
r += (rows[i].a + rows[i].b * 3); // row = 8 .. 64 
e Sphinx 
e MySQL 
eClickHouse 


SUM(a+b*3) vs 64 bytes 


linear.cpp, 64b, est RAM2010 0.63 
linear.cpp, 64b, est RAM2022 0.17 
linear.cpp, 64b, actual 6240R 0.19 


Sphinx, 64b, actual 40x 1.28 


MySQL, 64b, actual 40x 10.72 
(111) 


SEL 
SEC 
SEC 


SEC 
SEC 


SUM(a+b*3) vs 8 bytes 


linear.cpp, 8b, est RAM 


linear.cpp, 8b, est RAM+CPU 
sec 


linear.cpp, 8b, actual 6240R 
ClickHouse, 8b, actual 4x 


WHY??? 


0.022 sec 
0.075..0.130 


0.024 sec 
0.020 sec (!!!) 


No magic there 


1. Automatic SIMD and unroll, and pretty good ones 
=> 4.5 instr/value, —0.048 sec est 


2. Parallel execution, IPC > 1 (est 2.5) 
=> ??? actual ticks, ??? sec 


3. Parallel memory loads (read-ahead) 
=> max(ram,cpu) vs ram+cpu => —0.022 sec 


4. Compression in ClickHouse 


4x SSE loop 


movq 
movq 
pmulld 
movq 
paddd 
pmovsxdq 
paddg 


movq 
movq 
pmulld 
movq 
paddd 
pmovsxdq 
paddq 
movdga 


lea 
sub 
jne 


xmmO , xmm4 // this is 3.. i guess 
xmml,mmword ptr [rcx+rax-8] // 1x unroll 
xmm 1, xmmO 

xmmO,mmword ptr [rax-8] 

xmm 1, xmmO 

xmm 1 , «mm1 

xmm2, xmm1 


xmm0 , xmm4 

xmml,mmword ptr [rcx+rax] // 2x unroll 
xmm 1, xmmO 

xmmO,mmword ptr [rax] 

xmm1 , xmmO 

xmm 1, «mm1 

xmml, xmm3 

xmm3, xmml 


rax, [rax+10h] // add rax,16 
Fax, 1 
00007FF6C4D71C51 


